About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Wed Sep 11 11:49:41 2024.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Task One: Load and explore the data

## Import required packages

Describe and explore the data

## Loading and exploring the data
library(readr)
daily_data <- read_csv("C:/Users/srish/Downloads/day.csv")
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_data<- read_csv("C:/Users/srish/Downloads/hour.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_data)
## # A tibble: 6 × 16
##   instant dteday     season    yr  mnth holiday weekday workingday weathersit
##     <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       1 2011-01-01      1     0     1       0       6          0          2
## 2       2 2011-01-02      1     0     1       0       0          0          2
## 3       3 2011-01-03      1     0     1       0       1          1          1
## 4       4 2011-01-04      1     0     1       0       2          1          1
## 5       5 2011-01-05      1     0     1       0       3          1          1
## 6       6 2011-01-06      1     0     1       0       4          1          1
## # ℹ 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, cnt <dbl>
head(hourly_data)
## # A tibble: 6 × 17
##   instant dteday     season    yr  mnth    hr holiday weekday workingday
##     <dbl> <date>      <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1       1 2011-01-01      1     0     1     0       0       6          0
## 2       2 2011-01-01      1     0     1     1       0       6          0
## 3       3 2011-01-01      1     0     1     2       0       6          0
## 4       4 2011-01-01      1     0     1     3       0       6          0
## 5       5 2011-01-01      1     0     1     4       0       6          0
## 6       6 2011-01-01      1     0     1     5       0       6          0
## # ℹ 8 more variables: weathersit <dbl>, temp <dbl>, atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>
str(daily_data)
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ season    : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: num [1:731] 654 670 1229 1454 1518 ...
##  $ cnt       : num [1:731] 985 801 1349 1562 1600 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(hourly_data)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
##  $ season    : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   hr = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(daily_data)
##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714
summary(hourly_data)
##     instant          dteday               season            yr        
##  Min.   :    1   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   1st Qu.:2011-07-04   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   Median :2012-01-02   Median :3.000   Median :1.0000  
##  Mean   : 8690   Mean   :2012-01-02   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   3rd Qu.:2012-07-02   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0

Task Two: Create interactive time series plots

# Interactive Plot for Daily Data
library(timetk)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
daily_data %>% plot_time_series(.date_var = dteday, .value = cnt, .interactive = TRUE, .plotly_slider = TRUE, .title = "Daily Bike Rentals", .x_lab = "Date", .y_lab = "Number of Rentals")
# Interactive Plot for Hourly Data

hourly_data %>% plot_time_series(.date_var = dteday, .value = cnt, .interactive = TRUE, .plotly_slider = TRUE, .title = "Hourly Bike Rentals", .x_lab = "Datetime", .y_lab = "Number of Rentals")

Task Three: Smooth time series data

# calculating moving average of daily bike rentals
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
smooth_d <- ma(daily_data$cnt, order=7)

#creating a ggplot object
install.packages("ggplot2")
## Warning: package 'ggplot2' is in use and will not be installed
library(ggplot2)
ggplot(daily_data, aes(x= dteday, y= cnt)) + geom_line(color = "blue") + geom_line(aes(y= smooth_d), color= "red") + labs(title= "Original and Smoothed Daily Bike Rentals", x= "Date", y= "Number of Rentals") + theme_classic()
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).

Task Four: Decompose and access the stationarity of time series data

# creating a time series object

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
daily_ts <- ts(daily_data$cnt, start = c(2011, 1, 1), frequency = 365)
daily_dates <- seq.Date(from = as.Date("2011-01-01"), by = "day", length.out = length(daily_data$cnt))


# decomposing daily_ts into seasonal, trend, residual components
decomp <- stl(daily_ts, s.window = "periodic")

# plotting
plot(decomp)

# running the ADF Test to assess stationarity 
library(tseries)
adf.test(daily_ts) # results = not stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  daily_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
monthly_ts <- aggregate(daily_data$cnt, by = list(format(daily_dates, "%Y-%m")), FUN = sum)
  
# making it stationary
## log transformation
monthly_ts_log <- log(monthly_ts$x)

adf.test(monthly_ts_log) ###not stationary enough
## 
##  Augmented Dickey-Fuller Test
## 
## data:  monthly_ts_log
## Dickey-Fuller = -3.3434, Lag order = 2, p-value = 0.08563
## alternative hypothesis: stationary
## differencing
monthly_ts_diff <- diff(monthly_ts$x)

adf.test(monthly_ts_diff) ###not stationary enough
## 
##  Augmented Dickey-Fuller Test
## 
## data:  monthly_ts_diff
## Dickey-Fuller = -2.001, Lag order = 2, p-value = 0.572
## alternative hypothesis: stationary
## combining methods 

monthly_ts_diff_log <- diff(log(monthly_ts$x))
adf.test(monthly_ts_diff_log)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  monthly_ts_diff_log
## Dickey-Fuller = -2.4962, Lag order = 2, p-value = 0.3834
## alternative hypothesis: stationary

Task Five: Fit and forecast time series data using ARIMA models

# splitting the data into training and testing sets

set.seed(123)

train_index <- sample(nrow(daily_data), 0.8*nrow(daily_data))
train_data <- daily_data[train_index, ]
test_data <- daily_data[-train_index, ]

# building an ARIMA model
library(forecast)
arima_model <- auto.arima(train_data$cnt, ic= "bic")

# evaluate the model using test data 

## loading stats package
library(stats)

## defining p, d, q
p <- 1  # AR order
q <- 1  # differencing order
q <- 1  # MA order


arima_model <- arima(train_data$cnt, order = c(1, 1, 1))
summary(arima_model)
## 
## Call:
## arima(x = train_data$cnt, order = c(1, 1, 1))
## 
## Coefficients:
##           ar1      ma1
##       -0.0638  -1.0000
## s.e.   0.0414   0.0048
## 
## sigma^2 estimated as 3655586:  log likelihood = -5235.56,  aic = 10477.12
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE    MAPE      MASE
## Training set -34.18823 1910.321 1559.299 -69.81913 92.5421 0.6893145
##                     ACF1
## Training set 0.001001772

Task Six: Findings and Conclusions

In conclusion, using a variety of R time series analysis techniques, I investigated the daily bike rental demand data-set during this project. Based on our analysis, it appears that the data has significant seasonal and trend components, which ARIMA model was able to decompose and model.

Task 1: Open and Examine the Data After the hourly and daily data were successfully loaded and examined, a clear seasonal pattern was found in the data.

Task 2: Create Interactive Time Series Plots: To better understand the structure of the data, interactive plots were made to visualize the hourly and daily data.

Task 3: Smooth Time Series Data: The underlying trend was highlighted in the daily data by using the moving average smoothing technique.

Task 4: Decompose and assess Time Series Data Stationarity: Once the seasonal, trend, and residual components of the data were separated out, the ADF test showed that the data was not stationary. The data became stationary after differencing and log transformation were applied.

Task Five: Using ARIMA Models to Fit and Forecast Time Series Data: After being developed and assessed, an ARIMA model was found to fit the data well.

During this project, I learnt of the significance of:

  • Investigating and comprehending the patterns and structure of the data
  • Using the right methods to decompose and smooth time series data
  • Determining stationarity and putting transformations into practice to get there
  • Constructing and assessing ARIMA time series forecasting models

Given the significant seasonal and trend components in the data, the outcomes were generally what was anticipated. These patterns were well captured by the ARIMA model, and the forecasting outcomes were plausible.

My key takeaways were:

  • There are significant seasonal and trend components in the daily bike rental demand dataset.
  • With STL decomposition, the data can be effectively divided into seasonal, trend, and residual components.
  • For the data to become stationary, transformation and differencing are needed.
  • The demand for daily bike rentals can be accurately predicted using an ARIMA model.
  • In time series analysis, it is crucial to investigate and comprehend the patterns and structure of the data.

Overall, this project showed the value of using R’s time series analysis tools and offered insightful analysis of the daily bike rental demand dataset.